Statistics of collective human behaviors observed in blog entries 
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I. INTRODUCTION 



A new field attracts the attention of physics re- 
searchers when precise quantitative observation be- 
comes available. A good example of this is financial 
market price fluctuation that was not among physi- 
cists' interests twenty years ago. However, the huge 
number of precise observations and numerical simu- 
lation techniques led to a new field of applied statis- 
tical physics by the name of econophysics More 
recently, the sales figures for popular books have been 
analyzed in view of the universal responses of com- 
plex systems 0. Further, the increase in the number 
of registrations for a conference as the deadline nears 
has been studied and concluded to a result of the 
superposition of random processes caused by naive 
human behaviors @. 

At an individual level, any human behavior may be 
recognized as intentional and not random. However, 
by observing a large number of people simultaneously, 
we can expect to observe the random nature of the 
gross properties owing to mutual independence. The 
first step in the data analysis of human behaviors is to 
clarify how real human system shows randomness by 
comparing it with predictions, based on the assump- 
tion of independence. The next step is to establish an 
empirical relation about the gap between the simple 
theory and observation. The third step is to develop 
a new theory that captures a deeper human action or 
hidden social interaction. 

For example, paying attention to the time intervals 
of deals in financial markets, an exponential distribu- 
tion of intervals is naturally expected as the occur- 
rence of a deal is caused by a collision of buy and 
sell orders at the same price, which can be modeled 
by a Poisson process. However, in reality we gen- 
erally observe a time interval distribution character- 
ized by a fat-tail. It is clarified from data analysis 
that the underlying Poisson parameter is changing 
in the time scale of a few minutes and the observed 
fat-tailed distribution is given by the superposition 
of exponential distributions with various characteris- 
tic scales [J]. The human behavior underlying this 
phenomenon is considered to be a general tendency 
of market participants to modulate their own clock 
speed to be proportional to a moving average of the 
market's transaction intervals. This effect, named 
self-modulation, is known to induce long-term corre- 
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FIG. 1: Relation between temperature and number of 
blog entries, (a) Time series of tempetature and number 
of blogs including word "hot" and "cold" . (b) Scatter plot 
of temperature and number of blogs. 



lation and make the power spectrum of events follow 
a 1/ / spectrum An agent-based numerical model 
can reproduce this effect and we can find that the re- 
sulting transaction intervals are compatible with real 
data [6J. 

In this paper, we focus on the data analysis of blogs 
to observe collective human behaviors. A blog is a 
new type of social communication medium where per- 
sonal impression can be easily uploaded on the web. 
A blog first appeared in the late 1990s and rapidly 
gained popularity. The number of blogs has been in- 
creased every year since then. 

A typical blog is maintained by an individual or 
a small group. It consists of entries posted by users 
that contain text and/or images and are frequently 
accompanied by links to other web pages. One of 
the characteristics of a blog is that readers can easily 



leave their comments in an interactive format. Con- 
sequently, a blog has evolved as a new type of social 
communication tool for individuals. 

Search engine technologies have been developed to 
observe the details of blog entries automatically at 
high speeds @, [1] • We focus on the temporal changes 
in the frequency of a word on the web. For a given 
keyword, the search engine automatically lists all blog 
entries that include the keyword along with the time 
of posting. According to Technorati Q, an internet 
search engine company for blogs, the number of blogs 
in the world is now more than 70 million and this fig- 
ure is still increasing. Categorizing blogs in terms of 
language, we find that in 2007, 37% of the blogs were 
in Japanese, 36% in English, 8% in Chinese, etc. [lol ]. 
Here, we observe Japanese blogs from January 1st, 
2007 to December 31st, 2008 using a search engine 
provided by Dentsu Buzz Research This search 
engine covers 20 major blog providers in Japan cover- 
ing more than 10 million users. The total number of 
observed entries exceeds 600 million; in other words, 
on average, about 800 thousand new entries are up- 
loaded every day. 

Here, we show an example that blogs are often re- 
flect peoples reaction to the social and natural phe- 
nomena. In Fig[T] we show relationship between the 
average daily temperature in Tokyo and number of 
blog entries including the keywords "hot" and "cold" . 
In summer time, in the middle of FigQJa), it is get- 
ting hot and people tend to post blog entries with key- 
word "hot" more frequently. We confirm that there 
is a nonlinear relation between real temperature and 
number of entries by the scatter plot [FigQJb)] ■ 

II. BASIC STATISTICS 

A challenging blog analysis is applied for early de- 
tection of epidemics. Ginsberg et al. observed key- 
words such as "flu" by using the search engine query 
data on the Google homepage and confirmed a posi- 
tive correlation between the actual spread and word 
appearances [13]. Here, we focus on the more basic 
properties of word frequency. 

A pioneering basic study of blog word appearance 
has been done by Lambiotte et al. who analyzed ap- 
pearance intervals of common words [l3|]. It was re- 
ported that a naive Poissonian assumption fails in 
general even for low-frequency words. However, two 
of the authors (Y.S. and M.T.) recently examined 
similar time series of blog word appearances carefully 
and found that such deviation from the Poisson pro- 
cess is caused by the following three effects: 

1. Repeated spamming. 

2. Non-stationary changes caused by server break- 
downs or the growth in blog population. 

3. Periodic fluctuations during specific days of the 
week. 
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FIG. 2: Relation between average (F) and standard de- 
viation a in log-log scale. The dashed line corresponds 
to a = \J (F) and the solid line, to a — 0.11(F). (a) 
Empirical result for the 271 words, (b) Comparison with 
analytical solution of Eq.Q (bold solid line). 



By introducing normalization processes to eliminate 
these effects, we can confirm that, as expected the- 
oretically, low-frequency words actually follow the 
Poisson process [l4| . It is interesting that we still ob- 
serve deviation from the Poisson process for any high- 
frequency words. We focus on such high-frequency 
words. 

In order to clarify the reason for this deviation from 
the Poisson process, we introduce the following data 
analysis. We randomly chose 300 words from a dic- 
tionary of Japanese morphological analysis. We get 
rid of 29 words from these words in case that the time 
series has 5 times higher peak from the average or av- 
erage is more than 10 4 . Then, we get 78 nouns, 36 
verbs, 66 adverbs, 37 adjectives, 35 conjunctions, and 
19 other types of words. For the j-th word, we ob- 
serve the daily number of blogs that contain the j-th 
word by using the search engine Xj(t). Then, we ap- 
ply the above-mentioned normalization procedures to 
get a normalized blog number time series Fj (t) . The 
average over the entire observation time (Fj ) and the 

standard deviation <jj = J ((Fj — (Fj)) 2 } is plotted 
for each word in a log-log scale in Figj2{a). 

In the case that the appearance of a word is ap- 
proximated by the Poisson process, the standard de- 
viation is given by the square root of the average 



value, and a line with slope 1/2 is expected. For 
words with frequencies less than about 80 entries per 
day, this relation can be confirmed to be indepen- 
dent of the morphological category from the left part 
of Fig^a). For these words, the autocorrelations of 
intervals are confirmed to be almost and the fre- 
quency distributions are checked to pass statistical 
tests of the Poisson distribution; namely, the appear- 
ance of low-frequency words can be approximated by 
the Poisson process. It should be noted that this re- 
sult is so delicate that it cannot be obtained without 
the above normalization procedure, which might have 
been missing in the preceding study [13[ . 

For more-frequently appearing words, we can find 
a clear deviation from the simple Poisson process. As 
shown in the right part of FigJ^a), instead of a square 
root relation, a linear relation between standard de- 
viation and average holds empirically. Similar linear 
relations between standard deviation and average are 
observed in various fields such as river flow fluctu- 
ations, traffic fluctuations on highways 15] and nu- 
merical fluctuations in cellular molecules [16[ . In the 
case of traffic fluctuations, this relation is understood 
by a simple model named Random Diffusion (RD) 
model [13], where random walkers are randomly in- 
jected into a given network. By observing the flow of 
walkers at a node, the linear relation between aver- 
age flow strength and standard deviation is confirmed 
for large value of flow. Here, we introduce a simple 
model of posting blog entries as an application. 

We assume the situation that there is no spam and 
the system is stationary with no weekly period. There 
are n(t) bloggers who are actively posting blogs at 
time t. Each blogger independently posts a blog en- 
try containing the j'-th keyword with probability Cj. 
The number of bloggers n(t) is assumed to change 
randomly in the range of [N — A, N + A], where A is 
a positive constant. Thus, the mean number of blog- 
gers who post a blog entry containing the j-th word 
at time t is Fj(i) = Cjn(t). Considering the case of 
a uniformly distributed n(t) for simplicity, we obtain 
the probability density of the number of keyword ap- 
pearances Pj{Fj) by taking the superposition of the 
Poisson distribution. 
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From this probability density, the standard devia- 
tion <jj is expressed as a function of the average value 
(Fj } , and the relation for large A is approximated as 
follows. 
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FIG. 3: Example of automatic detection of peculiar fluc- 
tuations. 



By fitting this theoretical estimation to the em- 
pirical relation in Fig[2^a), we can tune the model's 
parameters. Since from FigJ^a), the bending point is 
estimated as (F) c = 82, and from Eq.([2]), it is given 
as (F) c = 3N 2 /A 2 , we have A/N = 0.19. The lin- 
ear relation between standard deviation and average 
in Eq.Q becomes a = 0.11(F), which automatically 
fulfills the empirical relation as shown in Figj2fb). 
This result demonstrates that the non-trivial relation 
between standard deviation and average is a general 
property caused by the daily fluctuations in the num- 
ber of bloggers. 

Although the above theoretical analysis captures 
the basic relation between the average number of key- 
words and the standard deviations, the data points 
in Figl2Ia) scatter largely for words with large aver- 
ages. It is confirmed that this scattering cannot be 
explained by the above superposition of stationary 
Poisson processes, and we need to consider the effect 
of temporal change of the appearance probability Cj . 



III. APPLICATION 

As an application of this linear relation between 
standard deviation and average, here, we introduce 
a method of finding extraordinary events hidden in 
fluctuations of word appearances in blogs. We pay 
attention to the name of new bread, which went on 
the market on January 29th, 2008. The time series of 
the daily number of blogs containing this new name 
after normalization is given in FigJ3] Before Decem- 
ber 26th, 2007, the number count of this keyword was 
0. Later, the number count grew rapidly as though in 
a type of oscillation. In order to remove the weekly 
period, we introduce a 7-day moving average which 
is defined by 



where M = 3. Then, we estimate the standard devia- 
tion around this value based on the above-mentioned 
empirical relation as follows: 



*j(*) = y^(*)(i+^^)> w 

where A/N = 0.19. The daily number count for this 
keyword is mostly within the 2a(t) band. However, 
occasionally it is higher. We traced back the news 
and found that external factors such as new items 
released on those days did have an effect: 

1. Jan./2nd/2008: One of the popular blogs intro- 
duced the name of this new bread. 

2. Jan./9th/2008: The company put out a news 
release announcing the new bread. 

3. Jan./21st/2008: The name of this new bread 
was introduced in a TV program. 

4. Jan./29th/2008: The new bread went on sale. 

By using this empirical standard deviation, we can 
successfully detect the peculiar fluctuations that were 
apparently caused by the above external factors. Fur- 
thermore, we can evaluate quantitatively the impact 
of the company's press releases to people. 



IV. DISCUSSION 

In summary, we first showed the importance of 
noise reduction for quantitative analysis of blog-word 
frequencies. For low- frequency words, we confirmed 
the Poisson hypothesis. For more-frequent words, we 
found that the effect of the change in the number of 
contributors modifies the distribution drastically and 
the standard deviation of blog-number fluctuations 
becomes proportional to the average values. Analyt- 
ical solutions successfully reproduce the above basic 
properties of blog-number fluctuations. 

In this research we only observed Japanese blogs; 
however, we expect that the analysis method and 
the numerical model is directly applicable to other 
languages. By establishing the basic observation 
methods and simulation models, blog analysis will 
become a powerful tool for the scientific study of 
collective human behaviors in the society. 
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